Install & Load Libraries
The first step, is to start installing and loading all the necessary libraries needed for our analysis. This is done by calling the below mentioned script.
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
##
## complete
##
## Attaching package: 'leaflet'
## The following object is masked from 'package:xts':
##
## addLegend
## Loading required package: carData
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following objects are masked from 'package:xts':
##
## first, last
## [1] "[3.4s]: All necessary packages installed and loaded"
LOAD DATA
To load stock data from Yahoo, we use the get.hist.quote function, as we want to get the stocks with dates and only their Close prices, which are needed for our analysis. This function, helps us get the necessary information from the first step, without having to download everything and subset!
nasdaq_symbols <- read_excel("dataset/nasdaq_symbols.xlsx")
all_stocks <- NULL
increment <- 1
for (i in nasdaq_symbols$Symbol){
temp_stocks <- get.hist.quote(instrument = i,quote = 'AdjClose', provider = 'yahoo', compression = 'd', quiet = TRUE, retclass = 'zoo')
if(increment == 1){
all_stocks <- temp_stocks
}
else{
all_stocks <- merge(all_stocks,temp_stocks)
}
increment <- increment + 1
}## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
##
## WARNING: There have been significant changes to Yahoo Finance data.
## Please see the Warning section of '?getSymbols.yahoo' for details.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.yahoo.warning"=FALSE).
CHECK FOR Nas …
Checking for NA values is necessary, as our script is intended to serve any given time frame. This means that some stocks at certain time frames will not have values for closing price.
# Check for NAs
summary(all_stocks)
nrow(all_stocks)… AND DEAL WITH THEM
In order to deal with NA values, as they will affect the return calculations, it is decided to impute the nearest Non - NA value, to any stock price that might have NA.
all_stocks_complete <- all_stocks %>%
na.locf(., na.rm = FALSE) %>%
na.locf(., fromLast = TRUE)
nrow(all_stocks_complete)## [1] 7047
CALCULATE THE DAILY LOG RETURN
After imputing the NA values, we can calculate the daily log return. Several benefits of using log returns, both theoretic and algorithmic.
Log-normality: is handy given much of classic statistics presumes normality.
Approximate raw-log equality: when returns are very small (common for trades with short holding durations), the approximation ensures they are close in value to raw returns.
For regression type calculations, taking logs of values can yield better results.
We need to address one thing though: - The last row will have NAs as it does not have a previous day to be calculated, so it is removed.
all_return_daily <- diff(log(all_stocks_complete))
all_return_daily <- all_return_daily[-1,]ADD DAY, MONTH, YEAR
At this step, we add the Day, Month and Year as separate columns to oUr datasets, for possible further use.
all_stocks_complete$day <- day(index(all_stocks_complete))
all_stocks_complete$month <- month(index(all_stocks_complete))
all_stocks_complete$year <- year(index(all_stocks_complete))
all_return_daily$day <- day(index(all_return_daily))
all_return_daily$month <- month(index(all_return_daily))
all_return_daily$year <- year(index(all_return_daily))LAST CHANGES FOR FINAL DATASET
At this stage, we do the last adjustments like
Convert the data into table format.
Melt function to change the data into a long format.
Remove scientific format for stock price.
Create a proper Date column as it the date is index.
AND THIS IS OUR FINAL DATASET (data per day)
tail(all_stocks_complete)## day month year symbol stock_price date
## 1: 12 12 2018 XLNX 90.54 2018-12-12
## 2: 13 12 2018 XLNX 89.13 2018-12-13
## 3: 14 12 2018 XLNX 88.82 2018-12-14
## 4: 17 12 2018 XLNX 87.18 2018-12-17
## 5: 18 12 2018 XLNX 89.23 2018-12-18
## 6: 19 12 2018 XLNX 85.22 2018-12-19
tail(all_return_daily)## day month year symbol stock_log_return date
## 1: 12 12 2018 XLNX 0.026978790 2018-12-12
## 2: 13 12 2018 XLNX -0.015695809 2018-12-13
## 3: 14 12 2018 XLNX -0.003484095 2018-12-14
## 4: 17 12 2018 XLNX -0.018636903 2018-12-17
## 5: 18 12 2018 XLNX 0.023242393 2018-12-18
## 6: 19 12 2018 XLNX -0.045981180 2018-12-19
GETTING VOLATILITY (SD) TO PLAY
At this step we create different aggregations of the data regarding Mean and Standard Deviation. All of them can be seen on the script, but below we mention the one used for our analysis.
return_mean_daily_by_symbol <- all_return_daily[, list(avg_stock_return = mean(stock_log_return),
sd_stock_return = sd(stock_log_return)),
by = list(symbol)]
return_mean_daily_by_symbol <- merge(return_mean_daily_by_symbol,nasdaq_symbols, by.x = 'symbol', by.y = 'Symbol')## symbol avg_stock_return sd_stock_return Name
## 1: AAL 7.970379e-05 0.02913133 American Airlines Group Inc
## 2: AAPL 9.457965e-04 0.02895474 Apple Inc
## 3: ADBE 7.721930e-04 0.02998959 Adobe Inc.
## 4: ADI 6.603589e-04 0.02875743 Analog Devices Inc
## 5: ADP 5.670731e-04 0.01549486 Automatic Data Processing Inc
## ---
## 99: WDAY 1.649721e-04 0.01103229 Workday Inc
## 100: WDC 4.057916e-04 0.03892628 Western Digital Corp
## 101: WYNN 3.558244e-04 0.02311840 Wynn Resorts Ltd
## 102: XEL 4.189168e-04 0.01531325 Xcel Energy Inc
## 103: XLNX 6.591518e-04 0.03155594 Xilinx Inc
FUNCTIONS CREATED FOR RISK ANALYSIS
At this step, 2 functions are created: One to return stock symbol with maximum risk in a given period of time between start and end. (Risk = average risk, calculated as standard deviation of daily returns)
MaximumRisk <- function(d=all_return_daily, start, end)
{
data.period <-subset(all_return_daily, all_return_daily$date >= as.Date(start) & all_return_daily$date <= as.Date(end))
for(i in nrow(data.period)){
print(data.period[which(data.period$stock_log_return == max(data.period$stock_log_return)),c('symbol','stock_log_return')])
}
}
MaximumRisk(all_return_daily, '2016-01-01', '2016-01-31')## symbol stock_log_return
## 1: FB 0.144286
FUNCTIONS CREATED FOR RISK ANALYSIS (II)
One to return stock symbol with lowest risk in a given period of time between start and end. (Risk = average risk, calculated as standard deviation of daily returns)
LowestRisk <- function(d=all_return_daily, start, end)
{
data.period <-subset(all_return_daily, all_return_daily$date >= as.Date(start) & all_return_daily$date <= as.Date(end))
for(i in nrow(data.period)){
print(data.period[which(data.period$stock_log_return == min(data.period$stock_log_return)),c('symbol','stock_log_return')])
}
}
LowestRisk(all_return_daily, '2016-01-01', '2016-01-31')## symbol stock_log_return
## 1: EBAY -0.1329909
FUNCTIONS CREATED FOR PLOTING
At this part we create functions that will provide time series plots for the daily stock return. (ggplotly = in order for the interactive chart to be faster, we subset only from 2000 and after)
after_2000_return <- all_return_daily[date>'2000-01-01']
plotting_returns <- function(x){
if (x %in% nasdaq_symbols$Symbol){
ggplotly(ggplot(data=after_2000_return[after_2000_return$symbol == x,], aes(x=date, y=stock_log_return))+
geom_line(color='cornflowerblue')+
ggtitle("Daily Stock Returns") +
xlab("Time Frame") + ylab("Stock Return")+
theme_minimal()
)
}
}PLOTING DAILY RETURN
FUNCTIONS CREATED FOR PLOTING (II)
At this part we create functions that will provide time series plots for the daily stock price. (ggplotly =in order for the interactive chart to be faster, we subset only from 2000 and after)
after_2000_stocks <- all_stocks_complete[date>'2000-01-01']
plotting_stocks_price <- function(x){
if (x %in% nasdaq_symbols$Symbol){
ggplotly(ggplot(data=after_2000_stocks[after_2000_stocks$symbol == x,], aes(x=date,y=stock_price))+
geom_line(color='cornflowerblue')+
ggtitle("Daily Stock Price") +
xlab("Time Frame") + ylab("Stock Price")+
theme_minimal()
)
}
}PLOTING DAILY STOCK PRICE
LINEAR MODEL
In order to test the relationship between mean and volatility, we will apply a linear regression model. First lets have a look at the relationship between the two through a scatter plot.
We then split our data to train and test sample, by 80% - 20% respectively.
set.seed(2018)
train.size <- 0.8
train.index <- sample.int(length(return_mean_daily_by_symbol$avg_stock_return), round(length(return_mean_daily_by_symbol$avg_stock_return) * train.size))
train.sample <- return_mean_daily_by_symbol[train.index,]
test.sample <- return_mean_daily_by_symbol[-train.index,]Develop the model with the mean return as response, and the volatility as independent variable.
model_0 <- lm(avg_stock_return~sd_stock_return, data = train.sample)Then, we analyze the effect of the absolute value of the VIX on stock market returns. Seeing the output of the regression analysis, we can conclude that the VIX and the returns have a slightly positive relationship with a coefficient of -0.01363 and an (intercept) coefficient of 0.0001795. Considering the regression output we can conclude that the level of the VIX has a statistically significant effect on stock market returns, at a 20% level.
lm_stats <- summary(model_0)
lm_stats##
## Call:
## lm(formula = avg_stock_return ~ sd_stock_return, data = train.sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.195e-04 -1.375e-04 -9.140e-06 1.237e-04 7.260e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.384e-04 8.343e-05 1.659 0.101
## sd_stock_return 1.492e-02 3.210e-03 4.647 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002509 on 80 degrees of freedom
## Multiple R-squared: 0.2126, Adjusted R-squared: 0.2027
## F-statistic: 21.6 on 1 and 80 DF, p-value: 1.305e-05
As we cannot be accurate about our estimates, let’s calculate the confidence intervals of the Volatility.
confint(model_0, level=.95)## 2.5 % 97.5 %
## (Intercept) -2.763569e-05 0.0003044437
## sd_stock_return 8.530697e-03 0.0213079420
plot_summs(model_0, scale = TRUE, plot.distributions = TRUE, inner_ci_level = 0.95)CLUSTERING
In order to do our Clustering Analysis we need to make some adjustments
Assign the dataframe to a new one (for our convenience)
Remove text fields from the data table, for the analysis.
Take care of NULLS if any
return_mean_daily_0 <- return_mean_daily_by_symbol
return_mean_daily_0$symbol <- NULL
return_mean_daily_0$Name <- NULLAfter that we need to rescale our variables in order ot have better results.
rescale(return_mean_daily_0$avg_stock_return)
rescale(return_mean_daily_0$sd_stock_return)Then we implement the k-means algorithm in order to get our clusters. After several trials, it is decided that we take as k=3 number of clusters.
results <- kmeans(return_mean_daily_0, 3)
table_cluster <-table(return_mean_daily_by_symbol$symbol, results$cluster)
head(table_cluster,10)##
## 1 2 3
## AAL 0 0 1
## AAPL 0 0 1
## ADBE 0 0 1
## ADI 0 0 1
## ADP 0 1 0
## ADSK 0 0 1
## ALGN 0 0 1
## ALXN 0 0 1
## AMAT 0 0 1
## AMGN 1 0 0
In order to visualize our results we create several plots.
Plot 1
Plot 2
Plot 3
Another thing that should be noticed and observed is the existence of outliers. Before that though, we need to calculate twi parameters:
The centers of our clusters.
The distances of the observations from their center, in order to keep the max 5 distances.
centers1 <- results$centers[results$cluster, ]
distances <- sqrt(rowSums((return_mean_daily_0 - centers1)^2))
outliers <- order(distances, decreasing=T)[1:5]
print(outliers)## [1] 85 66 81 88 51
print(return_mean_daily_0[outliers,])## avg_stock_return sd_stock_return
## 1: 4.247013e-05 0.04849946
## 2: 1.013815e-03 0.04350645
## 3: 4.027235e-04 0.04288946
## 4: 6.297767e-04 0.04180016
## 5: 4.982330e-04 0.04093417
So, let’s detect them through plotting!
CONCLUSION
So, taking into consideration all the above we can make the below mentioned points:
Through the plotting functions any stock can be explored, in any desired time frame.
Through risk functions, any stock can with high or low risk can be detected, also for the same time frame (or any other)
Lastly, after we have detected our desire stock (or stocks) from the previous mechanisms, we can detect at which cluster group this stock lays, in order to expand our portfolio to other stocks with common characteristics.
All in all, the purpose of my assignment was to provide anyone with the most basic and important tools to be able to make their own analysis regarding stocks and make their own decisions regarding what is important (or not) or what is of risk (or not) for them.